For this project we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.
Lending club had a very interesting year in 2016, so let's check out some of their data and keep the context in mind. This data is from before they even went public.
We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from here or just use the csv already provided. It's recommended you use the csv provided as it has been cleaned of NA values.
Here are what the columns represent:
Open the loan_data.csv file and save it as a dataframe called loans.
Check the summary and structure of loans.
Convert the following columns to categorical data using factor()
Let's use ggplot 2 to visualize the data!
Create a histogram of fico scores colored by not.fully.paid
Create a barplot of purpose counts, colored by not.fully.paid. Use position=dodge in the geom_bar argument
Create a scatterplot of fico score versus int.rate. Does the trend make sense? Play around with the color scheme if you want.
Call the e1071 library as shown in the lecture.
Now use the svm() function to train a model on your training set.
Get a summary of the model.
Use predict to predict new values from the test set using your model. Refer to the lecture on how to do this if you don't remember :)
You probably got some not so great results! With the model classifying everything into one group! Let's tune our model to try to fix this.
Use the tune() function to test out different cost and gamma values. In the lecture we showed how to do this by using train.x and train.y, but its usually simpler to just pass a formula. Try checking out help(tune) for more details. This is the end of the project because tuning can take a long time (since its running a bunch of different models!). Take as long or as little time with this step as you would like.
Quick hint, your tune() should look something like this:
tune.results <- tune(svm,train.x=not.fully.paid~., data=train,kernel='radial',
ranges=list(cost=some.vector, gamma=some.other.vector))